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Abstract 

This paper addresses the problem of detecting and 
describing anomalies in large sets of high-dimensional 
symbol sequences. 1 The approach taken uses unsu- 
pervised clustering of sequences using the normalized 
longest common subsequence (LCS) as a similarity 
measure, followed by detailed analysis of outliers to 
detect anomalies. As the LCS measure is expensive to 
compute, the first part of the paper discusses existing 
algorithms, such as the Hunt-Szymanski algorithm, 
that have low time-complexity. We then discuss why 
these algorithms often do not work well in practice 
and present a new hybrid algorithm for computing 
the LCS that, in our tests, outperforms the Hunt- 
Szymanski algorithm by a factor of five. The second 
part of the paper presents new algorithms for outlier 
analysis that provide comprehensible indicators as to 
why a particular sequence was deemed to be an out- 
lier. The algorithm provide a coherent description 
to an analyst of the anomalies in the sequence, com- 
pared to more ’normal’ sequences. The algorithms we 
present are general and domain-independent, so we 
discuss applications in related areas such as anomaly 
detection. 

1 Introduction 

We consider the problem of finding anomalies in a set 
of N discrete sequences A = {xi, x 2 , ..., xjv}, where 
each sequence x* = {xn, X&, ...Xi ni } where the Xij 
correspond to the j th symbol in the ith sequence. 
We assume all symbols are drawn from a finite but 
large alphabet A and that each sequence has a vari- 
able length rij. We define an anomaly in the sequence 

1 This work was supported by the NASA Aviation Safety 
Program, Aviation Systems Monitoring and Modeling element. 


to be a subsequence that differs significantly with re- 
spect to some measure from the majority of sequences 
in X. We use the normalized longest common sub- 
sequence as the measure underlying an unsupervised 
clustering algorithm. Those sequences that are ’far 
away’ from the majority of sequences in a cluster are 
called outliers or anomalies. 

Although this is a general problem in anomaly de- 
tection in large sequences, we have built this method- 
ology to address a key question in the aviation safety 
domain. We assume that we are given a set of se- 
quences X that correspond to N landings of a spe- 
cific aircraft model at a specific airport. The symbols 
that are recorded correspond to the switches in the 
cockpit of the airplane. As the pilot undergoes ma- 
neuvers to land the airplane, he or she flips switches 
and sets levers and other control mechanisms. The se- 
quence of switches that a pilot flips during the course 
of the landing phase of the flight corresponds to the 
sequence X{. Notice that in this scenario, since the 
duration of the landing phase can vary from flight 
to flight, and also since not all switches in the set A 
need to be flipped in each flight, the sequence length 
will be variable. We have tested our algorithms on se- 
quences where the size of the alphabet is |A| « 1000. 
A recent paper discusses the domain problem in more 
detail and elucidates some of the difficulties in ad- 
dressing it using standard methods [11]. An example 
of the kind of anomalies the system targets for detec- 
tion are mode awareness problems, such as confusion 
about the current state of cockpit automation. 

Previous approaches to the task of anomaly de- 
tection focus on continuous sensor data, and do not 
distinguish discrete sensors from continuous. In this 
process they ignore the non-continuous as well as the 
sequential nature of the discrete sensors. In compari- 
son, we focus on discrete sensors, specifically, sensors 
recording pilot actions, or switches. We are inter- 
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ested in the sequence in which the values for these 
sensors change during the course of a flight and find- 
ing anomalies in flight behavior based on this infor- 
mation. 

Our system performs two tasks, as part of the task 
of atypical events detection in flights: a) detection of 
atypical flights, b) finding events during the course of 
such flights that are anomalous or atypical. Task b) 
is important as each flight generates large amounts of 
data during its course, and simply identifying a flight 
as anomalous still leaves the problem of identifying 
the problem areas inside the flight unaddressed. 

The first step of our approach is to merge data 
from all discrete sensors into a single sequence. This 
is done by recording a sensor only when it makes a 
transition. This transformation provides us with an 
event sequence for each flight. Following this, in the 
absence of training data, we treat the problem of find- 
ing atypical flights as an unsupervised learning prob- 
lem. We first cluster the flights for each itinerary 
into groups, and identify the outliers in each clus- 
ter as atypical. We use the Longest Common Sub- 
sequence, a common measure in bioinformatics and 
Intrusion Detection systems, as the similarity mea- 
sure for clustering flight data. We then present two 
new algorithms that use Bayesian Networks to effi- 
ciently identify anomalous events during the course 
of the flight. 

We demonstrate the performance of these al- 
gorithms using operation information from about 
10,000 flights, and developing the base clusters and 
locating anomalous flights by using these sequences. 

A preliminary analysis of this problem may lead 
to a time-series based state-space approach, where 
at each time step, the current state of all symbols 
in the alphabet is recorded. For each sequence, we 
would obtain a matrix of size (n,; x |A)| in size. A 
decomposition of this matrix through an SVD ap- 
proach could be used to generate a list of anomalies. 
The problem with this, and most other state-space 
approaches, is that the results are independent of the 
order of the data. Thus, the inherent pointers of phe- 
nomenon such as mode confusion can be lost. 

Another approach that could be taken is to build 
Hidden Markov Models on the data sets and then an- 
alyze the likelihood of a particular sequence. While 
this approach maintains the sensitivity to the sequen- 
tial nature of the data, it does not lend itself to in- 
terpretable models. Thus, we develop a new class of 
algorithms to detect anomalies and to present them 
in a comprehensible manner. 


The area most closely related to the problem of 
anomaly detection in flight sequence data is that of 
anomaly detection. Anomaly detection in computer 
systems typically involves forming a model of normal 
behavior, and flagging any variation from the ’nor- 
mal’ model as an anomaly. The problem of anomaly 
detection in aircraft flight data is thus, by its nature 
related to the area of anomaly detection. 

The approach we use to detect outliers is similar 
to some sequence analysis approaches for anomaly 
detection [7, 10]. Though superior results have been 
reported using the LCS metric in anomaly detection 
system [7], the same papers have also reported the 
cost of computing the LCS as a deterrent. Here, we 
present a faster algorithm for computing sequence 
similarity using LCS, which can help improve the 
speed of LCS based anomaly detection system. The 
amount of manual work required to be done by a sys- 
tem analyst is a large concern in the area of anomaly 
detection. The latter part of the paper present al- 
gorithms that accurately identify anomalous events 
inside a sequence. These algorithms can easily be 
adapted to the requirements of an anomaly detection 
system, and automate the task of identifying anoma- 
lies to a much larger extent, thereby cutting down 
the work required to be done by an analyst. 

The area of sequence analysis and comparison owes 
much of its growth from its practical application in 
the area of bioinformatics. In this way, the new al- 
gorithms we present in this paper for sequence com- 
parison and analysis are also related to the area of 
bioinformatics. 

2 Outline of Approach 

The main steps for the approach we use for anomaly 
detection in flight data is described below: 

1. Cluster the sequences into groups, using the nor- 
malized common subsequence metric as the sim- 
ilarity measure. The clustering algorithm we use 
is the CLARA [1] k-medoids algorithm. 

2. For each cluster, identify a certain percentage of 
the outliers as the anomalous sequences. 

3. Do an analysis that tries to identify where and 
how the sequence deviates from normal behavior. 

Simply classifying a sequence as anomalous or non- 
anomalous is not sufficient from the point of view of 
the end-user, i.e., the aircraft data analyst. Due to 
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the large number of flights that fly each day, and 
the large amount of data they generate, identifying a 
certain percentage of flights as anomalous still leaves 
too much data to be processed by the analyst. Hence, 
we developed algorithms that analyze the anomalous 
sequences, in order to provide the analyst with a rea- 
sonable explanation of why a particular sequence was 
considered anomalous. 

3 The Hybrid LCS Algorithm 

We use the normalized longest common subse- 
quence (nLCS) as the similarity measure for compar- 
ing flight sequences. Thus, given two sequences X 
and Y, of lengths m and n respectively, we calculate 
the normalized LCS by the formula: 


length(LCS) 



Given two sequences X and Z, Z is a subsequence of 
X if removing some characters from X will produce 
Z. Z is a common subsequence of two sequences X 
and Y if Z is a subsequence of X and Y. The longest 
such subsequence between A and C shall be called the 
longest common subsequence (LCS). LCS is a very ef- 
fective metric as it is able to detect similarity between 
two sequences without restricting itself to a location- 
based one-to-one match. The LCS metric has an op- 
timal substructure property, which is the foundation 
of the well-known dynamic programming algorithm. 
Given two sequences X[m] and Y[n], the algorithm 
constructs a two-dimensional table L(m,n). An entry 
L(i,j) in the table gives the length of the LCS between 
the first i characters of X and the first j characters 
of Y. Figure 1 shows the table L constructed by the 
algorithm for two sample sequences. More informa- 
tion on the optimal substructure property and the 
dynamic programming algorithm can be found in [2] . 

The time complexity of the algorithm is m-n, 
which makes the algorithm very expensive over a 
certain sequence length. A vast amount of literature 
[3, 4, 5] exists that considers algorithms that attempt 
to improve the bounds on the computational time. 
A survey on LCS algorithms by Bergroth et al 
[6] divides such algorithms into three groups and 
compares their results empirically. The comparison 
turns up no clear winners. However, the group 
consisting of the algorithm introduced by Hunt and 
Szymanski [3] (and its variants), remain the most 
popular of the group, perhaps because they are 


relatively easy to implement. 


3.1 The Hunt-Szymanski Algorithm 

For completeness, we give the Hunt and Szymanski 
algorithm below. The algorithm we describe below 
gives just the length of the LCS. It does not give 
the actual LCS sequence, as we only need the 
length of the LCS to measure similarity. For the ver- 
sion of the algorithm that also gives the LCS, see [3] . 

Algorithm 1: The Hunt-Szymanski Algorithm. 

Input: Sequences X[m] and Y[n], 

Output: Length of LCS. 

Step 1: For each character in X, find where it occurs 
in Y. Store the information as a decreasing linked list 
array. 

for i:= 1 to m 

Set matchlist[i]:= < ■ ■ ■ ,jp >, such that 

ji > ji ■ ■ ■ > j P and X[i\ = Y\j q ], for i < q < p. 

Step 2. Initialize the thresh array. Set all values 

to n+1. 

for i:= 1 to n 

thresh[i] := n+1. 

Step 3. Compute thresh values through m iterations, 
for i:= 1 to m 

for each j on ma tchlist[i] 

Find k s.t. thresh[k — 1] < j < thresh[k]. 
if j < thresh[k] 
thresh[k]:=j. 

Step 4. The lcs length is the largest k such 
that threshfk] was updated, 
k := largest k such that thresh[k\ ^ n+1. 
return k. 

One way to understand the algorithm [3] is in 
term of successive computation of threshold values. 
Threshold TA = the smallest j such that X[l:i] and 
B [1 :j] contain a common subsequence of length k. At 
the end of iteration i, the array thresh contains the 
values Ti t Thus, at the end of m iterations, the 
LCS length is k such that T[k] has a valid value. 

Another way to look at the algorithm is in terms of 
the two-dimensional table L created by the standard 
dynamic programming algorithm and how the Hunt- 


3 



\ *1 
N 

1 

A 

7 

G 

a 

T 

4 i 
G 

$ 

G 

6 

c 

> 

T 

« 

c 

9 

G 

X) 

T 

it 

T 

*? 

A 

G 

0 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

A 

2 

1 

■1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

2 

G 

3 

1 

2 

2 

2 

2 

2 

2 

2 

2 

2 

2 

2 

G 

4 

1 

2 

2 

3 

3 

3 i 

3 

3 

3 

3 

3 

3 

T 

1 

2 

3 

3 

3 

3 

4 

4 

4 

4 

4 

4 

G 

5 1 

1 

2 

3 

4 

4 

4 

4 

4 

5 

5 

5 

5 

C 

T 

h — — 1 

1 

2 

3 4 

4 

5 

5 

5 

5 

5 

5 

5 

A 

1 

2 

3 4 

4 

5 

5 

5 

5 

5 

& 

6 


Figure 1: Example: Longest common subsequence 
table. 


Szymanski algorithm relates to it. We first cover 
some standard definitions: 

Given two sequences X and Y, (i. j) is said to be a 
match if X[i\ = Y\j\. 

A match (i,j) is called a k-match ( or is said to have 
rank k) if the location L(i,j) = k. 

A k-match (i,j) is called a k- dominant-match if, for 
all other pairs ( i',j ') of rank k, either i' > i and 
j' < j, or i! < i and j' > j. 

The broken line between all (k-l)-matches and k- 
matches is called a k- contour-line. For example, in 
Figure 1, (8, 6), (7, 6-9), (6, 9-12) form a 5-contour- line. 

Based on these definition we define a new term, a 
k-dominant contour point: 

A location (i,j) in table L is k-dominant contour 
point for line i if, for all other location pairs (i,f) 
with value k, j' > j. Notice that a k-dominant con- 
tour point does not have to be a match. For example, 
location (5,2) is a 2-dominant contour point, but it is 
not a match. 

In light of these definitions, we can say that, at 
the end of i iterations, the array thresh contains the 
location of the k-dominant contour point in row i at 
location thresh[k]. 

The Hunt-Szymanski(HS) algorithm for two se- 
quences X[m] and Y[n] can now be understood as 
follows: 

Algorithm 2: The Hunt-Szymanski Algorithm in 

terms of dominant contour points. 

For i = 1 to m, 


For each match ( Xi,yj),yj in decreasing order, 
if yj lies between the k th 
and (k + l) st dominant contour point, 
move the ( k + l) st dominant 
contour point location to yj . 

For example, suppose we are processing X[3] = ’G’ 
in example figure 1, using the HS algorithm. Look- 
ing at the indices backwards, we find the first match 
for G at location 9. We look back and see that this 
match lies between the current l-match(at 1) and 2- 
match(at 12). The algorithm does this by searching 
through ’thresh’ in step 3. As the match (3,9) lies 
between the current 1 and 2-dominant contour point, 
we move forward the 2-dominant contour point to 9. 
For the table L, this can be done by setting all values 
between 9 and 12 to 2. The HS algorithm does this 
by setting thresh[2] from 12 to 9. The reason why 
both these algorithms try to move dominant contour 
points forwards is to create space for more dominant 
matches. The later a dominant contour point occurs 
in sequence Y, with respect to sequence X, the less 
space is there for more matches of X with Y. 

The computation complexity of the HS algorithm 
can now be calculated as follows. Assuming both se- 
quences are of equal length(n), it searches through 
array thresh of length n, r times, where r is the total 
number of matches between the two sequences. Ig- 
noring the time required to create the index, the com- 
plexity of the HS algorithm is given by 0{r ■ logn). 

If o is the set of unique characters(symbols) over 
both sequences X and Y, and if there are k such 
unique characters, the value of r is given by 

where, 77 ( cr j v ) represents the count of character <j t 
in sequence X. 

As is clear from the expression, the value of r is 
small when the 0 frequencies are small, or when the 
frequency distributions for <r A and cA are different. 
However, if the sequences being compared are drawn 
from the same dataset and represent the same phe- 
nomenon, this is unlikely to be the case. 

In fact, in most areas where the LCS metric is used, 
these conditions seldom hold. For example, in bioin- 
formatics, the frequency distribution of characters is 
uniform and the character size is small(4-8), lead- 
ing to large values of 77(07). In anomaly detection 
[7, 8], system calls and user commands frequencies 
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are better modeled as Zipf-like distributions than as 
uniform distributions. In Web Usage Mining [9], an- 
other area where LCS can be used as a measure, web 
traffic is known to show a Zipf-like frequency distri- 
bution. Again, in text matching, language characters 
and words are known to follow a Zipf-like distribu- 
tion. Clearly the HS algorithm does not perform opti- 
mally for Zipf-like distributions, where certain ? 7 ( cr.j ) s 
are very large compared to others. 

A number of methods have been devised to reduce 
the actual value of r in the term 0(r ■ logn). A sim- 
ple one, which works very well in practice, will be to 
search through ’thresh’ only if we crossed a dominant 
contour point between two successive index locations. 
This can be done simply by first checking if the lo- 
cation k updated in the previous iteration satisfies 
thresh[k — 1] < j' < thresh[k\ , where j' is the cur- 
rent index location being processed. For characters 
with a large value of r, this cuts down the number of 
searches by quite a margin. However, this still means 
we have to search through the entire thresh array for 
each search. This search is more difficult to optimize. 

3.2 The Hybrid Algorithm 

One of the reasons that the HS algorithm does not 
perform better in many cases is because its data 
structure (the thresh array) provides it with very little 
information. We next present a new algorithm that 
attempts to make up for this drawback of the HS 
algorithm. The algorithm may be considered a hy- 
brid of the HS algorithm and the standard dynamic 
programming algorithm[2]. This is because, it es- 
sentially does what Algorithm 2 above does, like the 
HS algorithm. However, it maintains a different data 
structure, an array ’row’ which, at any iteration i of 
the algorithm, corresponds to the row L(i, l:n) of the 
standard dynamic programming algorithm. 

We give the hybrid LCS algorithm below: 

Algorithm 3: Hybrid Algorithm for LCS. 

Input: Sequences X[m] and Y[n]. 

Output: Length of LCS. 

Step 1: For each character in X, find where it occurs 
in Y. Store the information as a decreasing linked list 
array. 

for i:= 1 to m 

Set matchlist[i]:= < jJ, ■ ■ ■ ,j P >, such that 

j l > J 2 • • • > j P , and X[i] = Y[j q ] for i<q<p. 


Step 2. Initialize the row array. Set all values 
to 0. 

for i:= 1 to n, row[i] := 0. 

Step 3. Compute L(l,l:n) at each iteration i. 

Set prevJndex := n. 
for i:= 1 to m 

for each j on matchlist[i] 

Set row[j] := row[j] + 1. 

Step 3. a. Check if a dominant contour point 
exists between the two match points. 

If it does not, skip 
to next iteration, 
if row[j] = row [prevJndex] 
prevJndex = j. 
skip. 

Step 3.b. If a dominant contour point exists, 

End it and move it to j. 

End j < k < prevJndex s.t. 
row[k] = row[j] + 1. 
if 3k 

row[j+l . . . k] = row[j] + 1. 
prevJndex = j. 

Step 4. The last value in row gives the lcs length, 
k := row[n]. 
return k. 

Let us now see how the hybrid algorithm pro- 
cesses X[3] = ’G’ in example figure 1. At the 
start of iteration 3, ’row’ will look exactly like 
L(2,:) of the example. The algorithm will find the 
match j at location 9. It will look at the value 
of row(prevJndex) (prevJndex set at the beginning 
of Step 3). As it is greater than row[j], it will 
search through (j, prevJndex) for the dominant con- 
tour point. It will find it at 12, and increment 
row[j + 1 . . . prev index], thus moving the dominant 
contour point to j. It will then set prevJndex to j, 
and move to the next location, and so on. 

The complexity of the algorithm can be calculated 
as follows: We ignore the time taken to create in- 
dex, as with the HS algorithm. Step 2 takes linear 
time. Step 3. a is executed r times. Assuming a uni- 
form distribution of a, we can assume the matches 
r are distributed uniformly across row. In that case 
step 3.b is executed n • (min(r / n, count (dominant 
contour points))). As the maximum possible number 
of dominant contour points in a row is equal to the 
LCS of the sequence, we can say that Step 3b is ex- 
ecuted n • (min(r /n, LCS)) times. Each time we do 
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a binary search through a small portion of the sorted 
array row. Assuming that the matches are evenly 
distributed across each row, we say that the average 
length of array row is log(?r 2 /r): 

2 

T TL 

Oin ■ mint — , LCS) ■ log( — )) 
n r 

It is obvious that this value shall generally be much 
smaller than r • logn, and can never be larger. 

It must be remembered that the time complexity 
computed is only in terms of number of comparisons. 
The algorithm performs more read-writes than the 
HS algorithm. In the worst case, the time complex- 
ity of the number of read- writes performed is 0(n 2 ). 
However, we find that the cost of memory writes is 
negligible in practice. Also, as most of the writes con- 
sist of the same value written over contiguous areas 
inside an array, a clever programmer can easily find 
many ways to optimize these writes. But our current 
implementation makes no such optimizations. 

Table 1 and 2 compare the computation running 
time of the LCS algorithms described above. The 
test data we use is generated exactly the methodol- 
ogy followed in the survey paper on LCS algorithms 
by Bergroth et al [6]. More information on the test 
data generation methodology can be found in Sec- 
tion 5. Our results show that the hybrid algorithm 
runs many orders faster than the HS algorithm for all 
tests. 

4 Outlier Analysis 

For data embedded in a vector space, or following a 
known statistical distribution, once the outliers have 
been identified, the reasons why these particular data 
points were considered outliers is generally easy to 
answer. For example, in the vector space model, if a 
data point is an outlier, it is because it has an abnor- 
mally high/low value along one or more dimensions. 
Such properties are easy to detect. However, in the 
case of sequential data, even after the identification 
of outliers, it is often difficult to say, by simple obser- 
vation or analysis, why the sequence was considered 
an outlier. 

However, simply declaring a sequence as anoma- 
lous is not very useful for an analyst, particularly 
if the sequences are long. Manually analyzing a se- 
quence to discover why it was considered anomalous 
is a taxing and time-consuming activity. This section 
describes algorithms that attempt to automate this 


activity as far as possible. Our system, once it classi- 
fies a sequence as anomalous, analyzes this sequence 
further, and is expected to provide detailed informa- 
tion to the analyst about the atypical events inside 
the flight sequence. The information the system tries 
to provide the analyst with may be classified into (1) 
A sequence of switches was expected at a given stage 
but were not flipped. (2) A sequence of switches was 
not expected at a given stage but were flipped. (3) A 
sequence of switches were flipped in the wrong order. 

The approach we take to identify such anomalies 
inside a sequence consists of the following steps: 

Define/derive an objective function to maximize 
over the outlier sequence. We define an objec- 
tive function F for the outlier sequence, maximizing 
which, we believe, will maximize the likelihood that 
the sequence is not an outlier member of its cluster. 
A reasonable objective function is the average, or the 
weighted mean, of the similarity score(nLCS) of the 
outlier sequence with all the sequences in the clus- 
ter. Alternatively, we construct a generative model 
for the sequences in a cluster, and derive F from this 
model. 

Identify possible changes to the sequence. We next 
try to identify the changes (insertions /deletions) that, 
if made to the outlier sequence, will maximize this 
objective function F. 

We now divide anomalies to be identified inside 
a sequence into two categories, and define them in 
terms of the objective function O. 

Non-essential Character: A character at a loca- 

tion in a sequence is said to be non-essential to 
the sequence if its removal from that location will 
improve the objective function score F for the 
sequence, in relation to its cluster. 

Missing Character: A character is said to be 

missing from a particular location in a sequence if its 
addition at the location will improve the objective 
function score F for the sequence, in relation to its 
cluster. 

The problem of finding the anomalous areas in a 
sequence can thus be divided into two parts: a) find 
all the non-essential characters inside the anomalous 
sequence, and b) find all the missing characters from 
the sequence. 

We next discuss the objective function F and the 
different ways to construct it. 
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Sequence Lengths (characters) 

Algorithm Name 

500 

1000 

2000 

4000 

10,000 

20,000 

40,000 

Hybrid Algorithm 

0.001 

0.004 

0.02 

0.06 

0.50 

3.30 

14.00 

Hunt-Szymanski Algorithm 

0.006 

0.02 

0.06 

0.25 

1.60 

8.00 

33.40 

Std. Dynamic Programming 

0.007 

0.03 

0.14 

0.60 

3.90 

17.00 

114.50 


Table 1: Running time of LCS algorithms(seconds)- a distribution is Zipf 



Sequence Lengths (characters) 

Algorithm Name 

500 

1000 

2000 

4000 

10,000 

20,000 

40,000 

Hybrid Algorithm 

0.002 

0.008 

0.036 

0.14 

1.21 

4.90 

27.20 

Hunt-Szymanski Algorithm 

0.01 

0.038 

0.166 

0.52 

4.85 

20.90 

96.00 

Std. Dynamic Programming 

0.009 

0.032 

0.14 

0.64 

5.10 

18.10 

109.20 


Table 2: Running time of LCS algorithms (seconds)- a distribution is uniform 


4.1 The Objective Function 

4.1.1 Bayesian Network based model of F 

A generative model for a cluster can be constructed 
in terms of a simple Bayesian Network. As part of 
this network, each of the sequences in the cluster is 
generated by the centroid(identified during the clus- 
tering stage) by a certain probability. The probability 
of generation of a sequence from the centroid can be 
taken as proportional to the value of the normalized 
LCS score between the sequence and the centroid. 
The outlier sequence can be considered as being gen- 
erated by each of the sequences of the cluster by a 
certain probability. We assume that the probabil- 
ity of generation of the outlier from each sequence 
is proportional to the normalized LCS score between 
the sequence and the outlier. 

We now calculate the changes required to be made 
to the outlier sequence, that maximize the probabil- 
ity that the outlier sequence was generated from the 
cluster. 

Let C be the centroid, the sequences inside the clus- 
ter be represented by Si, S 2 , ■ ■ ■ ,s n , and O be the 
outlier. In that case, we want to maximize P(C/0). 


Since P{0\Si) oc nLCS{0,Si) and P(Si\C) oc 
nLCS(C, Si), we get 

N 

P{C\0 ) oc 53 nLCS{0, Si) ■ nLCS{C , 5,) 

i— 1 

Hence, in the case of a Bayesian Network based 
model for a cluster, the objective function to be max- 
imized is given by 


N 


F'(0, C) = J2 nLCS(0, Si) ■ nLCS(C, Si) 


N 


LCS(Q,Si) LCS (Si, C) 
Cl \fio ■ Is, \flSi ■ lc 

N 


= £ 
i— 

1 

^ l O ■ l C Cl 
N 


£ 


lcs(0,Si)-lcs(Si,C) 

£ 


P IcsjO, Si) ■ lcs(Sj, C) 


VTo Ci ls < 

Ignoring /3, the objective function is given by 


P(C\0) a P(0 A C) 


N 

P{0 A C) = P(C) 53 P(0\Si) ■ P(Si\C) 

i = 1 
N 

oc 53 P(0\Si) • P(Si\C) 

2=1 


1 flc S (0,S,).te(S„C) 

mc) = 7C'A S 

In general, a Bayesian Network based model based 
objective function is more effective when the cluster 
is large can be said to contains small sub-clusters, as 
the Bayes net model optimizes with respect to the 
sequences most similar to the outlier. 
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Figure 2: Time to compute LCS - Zipf Distribution 


Suppose we removed a character c from a location 
1 inside outlier sequence O. The impact would be as 
follows: 

1. The length of the outlier sequence would de- 
crease by 1. 

2. For the sequences Si where c is part of their LCS 
with O, the LCS value will decrease by 1. 

3. The LCS value will remain unchanged for all 
other sequences in the cluster. 

The objective function is given by: 


4.1.2 Mean / Weighted Mean based Objec- 
tive Function 

Given an outlier sequence O and a cluster C, a 
weighted mean based objective function F can be 
given by: 


N 

F(0, C) = J2 x i‘ nLCS(0, Si) 

i — 1 

Here N is the number of sequences in the cluster. 
The weight A can be set as equal for all sequences, 
or alternatively, it may be set as proportional to the 
score the sequence with the centroid. 

Expanding the value of nLCS(0, Si), we get 


F(0,C) = 


1 N 


lcs(0,Si)-lcs(Si,C) 


Let the new F(0,C) as a result of this change be 
represented by F’(0,C). F’(0,C) is given by: 


F'(0 C) = 1 ^ r ( lcs(Q , Sj) — 1) • lcs(Sj,C) 


Vb — \ 


c£lcs(0,Sj) 


v— ' lcs{0, Sk) ■ lcs(Sk, C) 1 

2^i F J 

c0cs(O,S k ) ** 


Vlo — 1 


N 

{E 

i—1 


lcs(0,Si)-lcs(Si,C) 

TZ 


F(0,c) 


1 ^ N LCS(0, Si) 

vio'h l ' 


E 

c£lcs(0,Sj ) 


lcs{Sj,C ) 1 

h 1 


It should be better to use the weighted mean based 
objective function when the clusters are small and ho- 
mogenous, as it assumes that the outlier should re- 
semble all sequences in the cluster equally closely, and 
enforces a very stricter homogeneity on the outlier, so 
that even slight deviations may count as anomalies. 


Le . £ 

ceics(o,Sj ) Sj 

Replacing b c and F(0,C) in the equation for 
F’(0,C) gives: 


4.2 Maximizing the objective function 

We discuss the algorithms used to find the changes 
that would maximize the objective function, based on 
the objective function derived for the Bayesian Net- 
work based model of a cluster. The same algorithms 
can be applied, with minor modifications, to the case 
where the objective function is defined as a weighted 


F\0, C) = -E= • (Vb ■ F(0, C) - b c ) 

It can be shown that, given k characters at different 
locations, all with same value of b c = b: 

F '{°, C) = ■ (• sflo ■ F(0, 0-k-b) 


mean. 




4.2.1 Detecting non-essential characters 

We can now present a simple greedy algorithm to 
find all characters removing which shall improve 
the objective function score. The algorithm is given ® 
below: j 

E 

Algorithm 4: Non-essential character Detection | 

Algorithm. f 

2 

Input: Outlier sequence O and cluster C with 
sequences < Si, S„ > 

Output: N ,the list of non-essential characters in O. 

Step 1 -.Declare array b[l], where 1 is the 

length of O. 

for i:= 1 to 1, b[i] := 0. 



Step 2:Calculate F, and b c for each character c in O. 
for i:= 1 to n 

Get the LCS of O with Si. For each 
c € LCS(0, Si), 

Set 6[c] := b[c] + lcs(0, Si) /Is,- 
Set F = F + lcs{0 , Si) ■ lcs(C, Si). 

Step 3: Find the next character to be replaced. 

Find b = min (b c ). 

Find all C, such that, b c = b. 

Set k := Size(C). 

Step 4: Calculate new value of F. 

Set F_old = F. 

f = 71=1 
If F > F-old 
Add c € C to N. 

Set b[c] = max+1, for all c € C. 

Go to Step 3. 


Figure 3: Time to compute LCS - Uniform Distribu- 
tion 


Here the value of b ct7 is given by the following ex- 
pression: 


b 


C<7 


E 


lcs{Si,C) 

Is, 


Here m is the number of sequences in the cluster 
will contain the symbol cr as part of their LCS with O, 
if it is added between location c-1 and c of sequence 

O. 

Similarly, adding k symbols to a sequence at var- 
ious locations, such that for all k symbols, b ca = b, 
will give change F as follows: 


Step 5. All non-essential characters are stored in N. 
Return N. 


4.2.2 Detecting missing characters 

This section focuses on the problem of finding the 
missing characters from a sequence. An analysis sim- 
ilar to the one above shows that adding a character 
cr to a location c in an outlier sequence O sequence 
changes the objective function F as follows: 

F'(0, C) = -^== -(Vb-F(0,C) + b crT ) 


F'(0, C) = -7==== • (s/lo ■ F(0, C) + k ■ b 
Vlo + k 

We provide below a greedy algorithm based on the 
above formulae: 

Algorithm 5: Missing characters Detection Al- 

gorithm. 

Input: Outlier sequence O and cluster C with 
sequences < Si , . . . , S n > 

N symbols '■ the total number of unique symbols 
in the sequence. 

Output: M ,the list of missing characters in O. 

Step l-.Declare array b[l], where 1 is the 
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length of O. 
for i:= 1 to 1 

for j:=l to N aymbo i s , b[i][j] := 0 

Step 2:Calculate F, and b cp for O. 
for i:= 1 to n 

Get the LCS of O with Si. 

Foreach Cj G LCS(0, Si), 

Set b[c] [/c] := b[c] [fc] + IcsiO , Si) /Is, ■ 
where k occurs in Si between Cj-\ and Cj. 
Set F = F + lcs(0 , Si) ■ lcs[C , Si). 

Step 3: Find the next set of missing characters. 
Find b = max (b). 

Find C = all (i,j) pairs, such that, b(i,j) = b. 

Set k := Number of such (i,j) pairs. 

Step 4: Calculate new value of F. 

Set F_old = F. 

F =7rhl-^- F + k - l ’> 

If F > F-old 
Add (i,j) G C to M. 

Set b(i,j)=0 for all ( i,j ) € C. 

Go to Step 3. 

Step 5. All missing characters and their locations 
are stored in M. 

Return M. 


4.2.3 Reconstructing missing character se- 
quences 

The algorithm given above can only detect the sym- 
bols that should be inserted between two characters 
in a sequence, but cannot detect the order in which 
they should be inserted. For example, suppose we 
have a symbol set A,B,...,F. The algorithm might, in 
that case, be able to say that characters D and E 
should be inserted between characters A and C in a 
given outlier sequence, but not whether the final se- 
quence should look like ’ADEC’ or ’AEDC’. We need 
a more sophisticated algorithm in order to do that. 

There are two possible approaches towards an algo- 
rithm which shall detect the exact sequence in which 
characters should be inserted, along with the char- 
acters to be inserted. The naive approach would 
be to actually insert the character in the outlier se- 
quence each time the addition algorithm prescribes 
an addition, and recalculating the array b accord- 
ingly. Updating b might involve re-comparing the 
modified outlier sequence with all the sequences in 


the cluster. 

We present below an alternative approach, which 
minimizes the number of comparisons between the 
outlier sequence and the cluster, but allowing extra 
characters to creep into the sequence. However, at 
the end, we make a call to Algorithm 3 above, which 
detects and removes these extra characters. 

The algorithm is based on the observation that, for 
a given sequence of length 1, adding a symbol sigma 
before a character c will increase the score F of the 
sequence with respect to the cluster only if the value 
of b ca is above the certain threshold. This threshold 
can be derived to be: 

b ca > T = (VT+1 -yfl)-F 

If every insertion we make to the sequence is 
a valid insertion, that is , it is greater than the 
threshold described above, the value of this thresh- 
old increases with every insertion. Thus, at a given 
sequence length 1, if we pick all (c, a) pairs such that 
b ca is greater than the threshold at 1, we have picked 
all the possible candidates for insertion in the future, 
as the threshold value can only increase. Due to lack 
of space, we only give a short outline of the algorithm. 

Algorithm 6: Algorithm to compute missing 

character sequences. 

Step 1: Given an outlier sequence O of length 

1 and cluster C, calculate the array b(l,p(a)) and the 
current value of F, as in Step 2 of algorithm 5. 

Step 2: Calculate the current value of the threshold 

T. 

Step 3: Set O' := O. Set b' = b. Find all 
Candidates = b(i,j) > T. 

Step 4: Before each row i of b, find character 

op if it exists, such that b{i,j) G Candidates and 
b(i,j) = max(b(i th row). Insert <jj in O', at the 
location in O' for which b' has the maximum value 
for (7 j , between the rows corresponding to i-1 and i 
in b. Set b(i,j) = 0. 

Step 5: Calculate b’{l,ri{a)) for the new O, , 

as in Step 2 of algorithm 5. Go to Step 4. 

Step 6: Call Algorithm 4 with O and C, to 

remove inessential characters that were added. 
Compare O 0 id a nd O to get the list of additions. 
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Essentially, Step 4 and 5 combine to find what the 
final sequence would look like if all the candidates 
were inserted into the sequence. These steps work as 
follows: 

Suppose we have an outlier sequence ABCD, and 
step 3 finds characters E,F and G, above T between 
A and B, such that b(B,E) > b(B,F) > b(B,G), and 
I > T between C and D(here b’(B,E) is read as the 
score in array b’ for character E, for its insertion be- 
fore location B in ABCD). In the first iteration, the 
algorithm will insert E between A and B, and I be- 
tween C and D. It will then recalculate the b ca ar- 
ray for AEBCID. It will then check if F has a higher 
score before E or after E, and insert F accordingly. 
Suppose the new sequence is AFEBCID. It will then 
check whether G has a higher value in b’ between AF, 
FE, or EB and insert G there. The final sequence may 
look like AFGEBCID. 

Step 6 calls Algorithm 4 to remove the extra char- 
acters from O’. Now, supposing it was found that F 
was a non-essential character in the sequence and re- 
moved. However, we know that G is still in the right 
location because, supposing F had never been in- 
serted, b’(E,G) would still be greater than b’(B,G),as 
the same relation held when F was inserted between 
A and E, even though the insertion of F reduced the 
value of b’(E,G). 

4.3 Discussion: Optimality of Algo- 

rithms for insertion/deletion 

We say an algorithm for detecting non-essential / 
missing characters is optimal if it gives the exact set 
of deletions/insertions which will maximize the value 
of the objective function F. However, the greedy ap- 
proaches we presented above are optimal only when 
the values of b c / b ca do not change once they are 
calculated at the beginning of the algorithm, or do 
not change significantly enough to have an impact. 
However, this is not always be the case. The mar- 
gin of error may be acceptable in many cases, as 
it is in ours, as the purpose of our system is only 
to provide pointers to analysts about the potential 
problem areas given an anomalous sequence. How- 
ever, an approach to get the optimal value of F, if 
we are looking for non-essential characters, can be 
to call the algorithm 6 to find all missing charac- 
ters right after the algorithms for non-essential has 
been called, with the modified sequence as input. If 
it suggests an insertion of a character at a location 
where the character already existed before the clele- 



Figure 4: Sample: Generated graph for an atypical 
flight. 


tion algorithm was called, re-insert that character. 
However, again, it may be the case that the insertion 
algorithm makes some wrong suggestions. To cover 
all such cases, the algorithm to detect non-essential 
characters may be called again, and so on alternately 
till one of the algorithms makes no further sugges- 
tions of insertions/deletions. 

5 Experimental Results 

The two datasets used to compare the LCS algo- 
rithms was generated synthetically, using exactly the 
procedure used by Bergroth et al[6] in their survey, to 
facilitate comparison with existing LCS algorithms. 
The first dataset consisted of sequences drawn from 
a uniform distribution with a = 8, to simulate se- 
quences found in the bioinformatics domain. The 
second dataset was drawn from a Zipf distribution 
with a = 256, to simulate other common domains 
where LCS is in use. While [6] tested only with se- 
quences of length 4000, we tested over a much larger 
range of sequence lengths, between 500 and 40,000. 
The results of the comparison are presented in Ta- 
bles 1 and 2. Figures 2 and 3 provide plots of the 
same data, to facilitate comparison. As the results 
clearly show, the new hybrid algorithm outperforms 
the other two algorithm by many times. 

The clustering and outlier analysis algorithms were 
used over a dataset consisting of the landing phase 
sensor information for 6400 flights. The data was re- 
duced to a sequence by recording, at any given time, 
only sensors that changed their state. After this re- 
duction, the sequence dataset consisted of 6400 dis- 
tinct sequences, varying in length from 600 to over 
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9000. The number of distinct symbols (a) was around 
700. The average sequence length was approximately 
1500. 

The number of distinct clusters in the data was em- 
pirically estimated to be three, which was the value 
passed to the k-medoids algorithm. We expect the 
clustering step to not take more than 10 minutes in 
a C implementation. 

Our outlier analysis algorithms output the follow- 
ing information: A graphical representation of the 
anomalous areas inside an outlier sequence. A sam- 
ple graph for an atypical flight is presented in Fig- 
ure 4. The horizontal axis, read from left to right, 
represents the time remaining until the plane’s tires 
touched the runway. The positive direction of the 
vertical axis represents the suggested insertion, that 
is, switches the algorithm thinks the pilot should have 
pressed at that stage. The negative direction repre- 
sents the suggested deletions, that is, the switches the 
algorithm believes the pilot should not have pressed 
at that stage. The height of the columns indicates 
the confidence the algorithm has in its prediction, 
which is measured as proportional to the improve- 
ment in the score of the objective function F, if the 
insertion/deletion suggested was actually made. A 
graph with no bars would be representative of a com- 
pletely normal flight, as perceived by the sequence 
algorithm. The graph provides a simple visual inter- 
face that allows the analyst to focus his interest on 
the areas which are suspected to be most anomalous, 
and saving him the trouble of analyzing the entire 
sequence. A detailed report describing these anoma- 
lous areas. A report is generated in parallel by the 
system, which gives a list of switches that the algo- 
rithm believes should have been pressed/not pressed 
at that stage. 

6 Conclusions 

The paper describes a system designed with the aim 
of detecting anomalies in discrete flight data. It does 
so by clustering flight data sequences using the nor- 
malized longest common subsequence(nLCS) as the 
similarity measure. As the nLCS is an expensive 
measure to compute, we discussed algorithms, such 
as the Hunt-Szymanski algorithm, which have lesser 
runtime complexity. We also discussed why they do 
not work so well in actual practice, and presented 
a new hybrid algorithm, which is many orders faster 
than the Hunt-Szymanski algorithm. This is an im- 
portant contribution as the LCS is a commonly used 


algorithm in many areas and the running time of LCS 
algorithms is a frequent bottleneck. Following this, 
we presented algorithms, based on simple models of 
sequence clusters, that detect anomalies inside se- 
quences. In doing this, we step beyond what most 
current anomaly detection systems do, in not only 
predicting which sequences are anomalous, but by 
providing explanations as to why these particular se- 
quences were chosen as anomalous. Our approach is 
general and not restricted in any way to the aerospace 
domain, and we feel these algorithms can be of in- 
terest in areas such as anomaly detection and event 
mining. 
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